33 research outputs found

    Text-based and Signal-based Prediction of Break Indices and Pause Durations

    Get PDF
    The relation between symbolic and signal features of prosodic boundaries is experimentally studied using prediction methods. Text-based break index prediction turns out to be fairly good, but signal-based prediction and pause duration prediction perform worse. A possible reason is that random signal feature variations, as usually produced by humans, are hard to predict

    Text Preprocessing for Speech Synthesis

    Get PDF
    In this paper we describe our text preprocessing modules for English text-to-speech synthesis. These modules comprise rule-based text normalization subsuming sentence segmentation and normalization of non-standard words, statistical part-of-speech tagging, and statistical syllabification, grapheme-to-phoneme conversion, and word stress assignment relying in parts on rule-based morphological analysis

    Towards functional modelling of relationships between the acoustics and perception of vowels

    No full text
    This paper summarizes our research efforts in functional modelling of the relationship between the acoustic properties of vowels and perceived vowel quality. Our model is trained on 164 short steady-state stimuli. We measured F1, F2, and additionally F0 since the effect of F0 on perceptual vowel height is evident. 40 phonetically skilled subjects judged vowel quality using the Cardinal Vowel diagram. The main focus is on refining the model and describing its transformation properties between the F1/F2 formant chart and the Cardinal Vowel diagram. An evaluation of the model based on 48 additional vowels showed the generalizability of the model and confirmed that it predicts perceived vowel quality with sufficient accuracy

    Years Of Phondat-II: A Reassessment

    No full text
    In this paper we conduct an evaluation as well as a reassessment of the PhonDatII spoken language resource. 10 years after the record of PhonDatII it is time to summarize and to look into its future. At present, the corpus comprises 39612 manually labelled phone tokens and 15083 syllable tokens of read German utterances. We describe the corpus in detail, and then we present a new method to evaluate segmentation boundaries. Finally, we ask the question as to how we can refine the PhonDatII database for the future. The mean phone duration results of this study, which are based on a corrected and extended version of the PhonDatII corpus, are in correspondence with earlier research. Consequently, the actual size of this spoken language resource seems to be sufficient for generalization of results on the segmental level

    Reducing Segmental Duration Variation by Local Speech Rate Normalization of Large Spoken Language Resources

    No full text
    We developed a time-domain normalization procedure which uses a speech signal and its corresponding speech rate contour as an input, and produces the normalized speech signal. Then we normalized the speech rate of a large spoken language resource of German read speech. We compared the resulting segment durations with the original durations using several three-way ANOVAs with phone type and speaker as independent variables, since we assume that segment duration variation is determined by segment type (intrinsic duration), by the speaker (speech rate, sociolect, ideolect, dialect, speech production variation), and by linguistic effects (context, syllable structure, accent, and stress). One important result of the statistical analysis was, that the influence of the speaker on segment duration variation decreased dramatically (factor 0.54 for vowels, factor 0.29 for consonants) when normalizing speech rate, despite the fact that sociolect, ideolect, and dialect remained almost unchanged. Since the interaction between the independent variables speaker and phone type remained constantly, the hypothesis arises, that this interaction contains most of the speaker-specific information

    The /i/-/a/-/u/-ness of Spoken Vowels

    No full text
    This paper investigates acoustic, phonetic, and phonological representations of spoken vowels. For this purpose four experiments have been conducted. First, by drawing the analogy between the spectral energy distribution of vowels and the vowel space concept of Dependency Phonology, we achieve a new phonologically motivated vowel quality representation of spoken vowels which we name the /i/-/a/-/u/-ness. As a second step, it is shown that the extension of this approach is connected with the work of Pols, van der Kamp & Plomp 1969 [1] who, among other things, predicted formant frequencies from the spectral energy distribution of vowels. Third, the vowel quality relating to the IPA vowel diagram is derived directly from the spectral energy distribution. Finally, we compare this method with a formant and fundamental frequency based approach introduced by Pfitzinger 2003 [2]. While both the /i/-/a/-/u/-ness of vowels as well as the perceived vowel quality prediction are quite robust and therefore useful for both signal pre-processing and vowel quality research, the formant prediction achieved the lowest accuracy for the mapping to the IPA vowel diagram
    corecore